35 research outputs found
Recommended from our members
Multimodal Indexing of Presentation Videos
This thesis presents four novel methods to help users efficiently and effectively retrieve information from unstructured and unsourced multimedia sources, in particular the increasing amount and variety of presentation videos such as those in e-learning, conference recordings, corporate talks, and student presentations. We demonstrate a system to summarize, index and cross-reference such videos, and measure the quality of the produced indexes as perceived by the end users. We introduce four major semantic indexing cues: text, speaker faces, graphics, and mosaics, going beyond standard tag based searches and simple video playbacks. This work aims at recognizing visual content "in the wild", where the system cannot rely on any additional information besides the video itself. For text, within a scene text detection and recognition framework, we present a novel locally optimal adaptive binarization algorithm, implemented with integral histograms. It determines of an optimal threshold that maximizes the between-classes variance within a subwindow, with computational complexity independent from the size of the window itself. We obtain character recognition rates of 74%, as validated against ground truth of 8 presentation videos spanning over 1 hour and 45 minutes, which almost doubles the baseline performance of an open source OCR engine. For speaker faces, we detect, track, match, and finally select a humanly preferred face icon per speaker, based on three quality measures: resolution, amount of skin, and pose. We register a 87% accordance (51 out of 58 speakers) between the face indexes automatically generated from three unstructured presentation videos of approximately 45 minutes each, and human preferences recorded through Mechanical Turk experiments. For diagrams, we locate graphics inside frames showing a projected slide, cluster them according to an on-line algorithm based on a combination of visual and temporal information, and select and color-correct their representatives to match human preferences recorded through Mechanical Turk experiments. We register 71% accuracy (57 out of 81 unique diagrams properly identified, selected and color-corrected) on three hours of videos containing five different presentations. For mosaics, we combine two existing suturing measures, to extend video images into in-the-world coordinate system. A set of frames to be registered into a mosaic are sampled according to the PTZ camera movement, which is computed through least square estimation starting from the luminance constancy assumption. A local features based stitching algorithm is then applied to estimate the homography among a set of video frames and median blending is used to render pixels in overlapping regions of the mosaic. For two of these indexes, namely faces and diagrams, we present two novel MTurk-derived user data collections to determine viewer preferences, and show that they are matched in selection by our methods. The net result work of this thesis allows users to search, inside a video collection as well as within a single video clip, for a segment of presentation by professor X on topic Y, containing graph Z
A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models
Large language models have become a vital component in modern NLP, achieving
state of the art performance in a variety of tasks. However, they are often
inefficient for real-world deployment due to their expensive inference costs.
Knowledge distillation is a promising technique to improve their efficiency
while retaining most of their effectiveness. In this paper, we reproduce,
compare and analyze several representative methods for task-agnostic
(general-purpose) distillation of Transformer language models. Our target of
study includes Output Distribution (OD) transfer, Hidden State (HS) transfer
with various layer mapping strategies, and Multi-Head Attention (MHA) transfer
based on MiniLMv2. Through our extensive experiments, we study the
effectiveness of each method for various student architectures in both
monolingual (English) and multilingual settings. Overall, we show that MHA
transfer based on MiniLMv2 is generally the best option for distillation and
explain the potential reasons behind its success. Moreover, we show that HS
transfer remains as a competitive baseline, especially under a sophisticated
layer mapping strategy, while OD transfer consistently lags behind other
approaches. Findings from this study helped us deploy efficient yet effective
student models for latency-critical applications.Comment: Accepted to EMNLP 2023 Industry Trac
Semantic keyword extraction via adaptive text binarization of unstructured unsourced video
We propose a fully automatic method for summarizing and indexing unstructured presentation videos based on text extracted from the projected slides. We use changes of text in the slides as a means to segment the video into semantic shots. Unlike precedent approaches, our method does not depend on availability of the electronic source of the slides, but rather extracts and recognizes the text directly from the video. Once text regions are detected within keyframes, a novel binarization algorithm, Local Adaptive Otsu (LOA), is employed to deal with the low quality of video scene text, before feeding the re-gions to the open source Tesseract1 OCR engine for recognition. We tested our system on a corpus of 8 presentation videos for a total of 1 hour and 45 minutes, achieving 0.5343 Precision and 0.7446 Recall Character recognition rates, and 0.4947 Precision and 0.6651 Recall Word recognition rates. Besides being used for multimedia documents, topic indexing, and cross referencing, our system can be integrated into summarization and presentation tools such as th
CoSiNES: Contrastive Siamese Network for Entity Standardization
Entity standardization maps noisy mentions from free-form text to standard
entities in a knowledge base. The unique challenge of this task relative to
other entity-related tasks is the lack of surrounding context and numerous
variations in the surface form of the mentions, especially when it comes to
generalization across domains where labeled data is scarce. Previous research
mostly focuses on developing models either heavily relying on context, or
dedicated solely to a specific domain. In contrast, we propose CoSiNES, a
generic and adaptable framework with Contrastive Siamese Network for Entity
Standardization that effectively adapts a pretrained language model to capture
the syntax and semantics of the entities in a new domain.
We construct a new dataset in the technology domain, which contains 640
technical stack entities and 6,412 mentions collected from industrial content
management systems. We demonstrate that CoSiNES yields higher accuracy and
faster runtime than baselines derived from leading methods in this domain.
CoSiNES also achieves competitive performance in four standard datasets from
the chemistry, medicine, and biomedical domains, demonstrating its cross-domain
applicability.Comment: Accepted by Matching Workshop at ACL202
Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models
Large pretrained language models have achieved state-of-the-art results on a
variety of downstream tasks. Knowledge Distillation (KD) into a smaller student
model addresses their inefficiency, allowing for deployment in
resource-constrained environments. However, KD can be ineffective when the
student is manually selected from a set of existing options, since it can be a
sub-optimal choice within the space of all possible student architectures. We
develop multilingual KD-NAS, the use of Neural Architecture Search (NAS) guided
by KD to find the optimal student architecture for task agnostic distillation
from a multilingual teacher. In each episode of the search process, a NAS
controller predicts a reward based on the distillation loss and latency of
inference. The top candidate architectures are then distilled from the teacher
on a small proxy set. Finally the architecture(s) with the highest reward is
selected, and distilled on the full training corpus. KD-NAS can automatically
trade off efficiency and effectiveness, and recommends architectures suitable
to various latency budgets. Using our multi-layer hidden state distillation
process, our KD-NAS student model achieves a 7x speedup on CPU inference (2x on
GPU) compared to a XLM-Roberta Base Teacher, while maintaining 90% performance,
and has been deployed in 3 software offerings requiring large throughput, low
latency and deployment on CPU.Comment: 11 pages, 5 figure
NASTransfer: Analyzing Architecture Transferability in Large Scale Neural Architecture Search
Neural Architecture Search (NAS) is an open and challenging problem in
machine learning. While NAS offers great promise, the prohibitive computational
demand of most of the existing NAS methods makes it difficult to directly
search the architectures on large-scale tasks. The typical way of conducting
large scale NAS is to search for an architectural building block on a small
dataset (either using a proxy set from the large dataset or a completely
different small scale dataset) and then transfer the block to a larger dataset.
Despite a number of recent results that show the promise of transfer from proxy
datasets, a comprehensive evaluation of different NAS methods studying the
impact of different source datasets has not yet been addressed. In this work,
we propose to analyze the architecture transferability of different NAS methods
by performing a series of experiments on large scale benchmarks such as
ImageNet1K and ImageNet22K. We find that: (i) The size and domain of the proxy
set does not seem to influence architecture performance on the target dataset.
On average, transfer performance of architectures searched using completely
different small datasets (e.g., CIFAR10) perform similarly to the architectures
searched directly on proxy target datasets. However, design of proxy sets has
considerable impact on rankings of different NAS methods. (ii) While different
NAS methods show similar performance on a source dataset (e.g., CIFAR10), they
significantly differ on the transfer performance to a large dataset (e.g.,
ImageNet1K). (iii) Even on large datasets, random sampling baseline is very
competitive, but the choice of the appropriate combination of proxy set and
search strategy can provide significant improvement over it. We believe that
our extensive empirical analysis will prove useful for future design of NAS
algorithms.Comment: 19 pages, 19 Figures, 6 Table